Supplemental content output

ABSTRACT

Techniques for generating a personalization identifier that is usable by a skill to customize output of supplemental content to a user, without the skill being able to determine an identity of the user based on the personalization identifier, are described. A personalization identifier may be generated to be specific to a skill, such that different skills receive different personalization identifiers with respect to the same user. The personalization identifier may be generated by performing a one-way hash of a skill identifier, and a user profile identifier and/or a device identifier. User-perceived latency may be reduced by generating the personalization identifier at least partially in parallel to performing ASR processing and/or NLU processing.

BACKGROUND

Natural language processing systems have progressed to the point wherehumans can interact with computing devices using their voices andnatural language textual input. Such systems employ techniques toidentify the words spoken and written by a human user based on thevarious qualities of received input data. Speech recognition combinedwith natural language understanding processing techniques enablespeech-based user control of computing devices to perform tasks based onthe user's spoken inputs. Speech recognition and natural languageunderstanding processing techniques may be referred to collectively orseparately herein as spoken language understanding (SLU) processing. SLUprocessing may be used by computers, hand-held devices, telephonecomputer systems, kiosks, and a wide variety of other devices to improvehuman-computer interactions.

BRIEF DESCRIPTION OF DRAWINGS

For a more complete understanding of the present disclosure, referenceis now made to the following description taken in conjunction with theaccompanying drawings.

FIG. 1 is a conceptual diagram illustrating a system configured togenerate and use personalization identifiers for customizing output ofsupplemental content, according to embodiments of the presentdisclosure.

FIG. 2 is a conceptual diagram illustrating example data that may bestored in a personalization identifier cache, according to embodimentsof the present disclosure.

FIG. 3 is a conceptual diagram illustrating example data that may bestored in a skill configuration storage, according to embodiments of thepresent disclosure.

FIG. 4 is a process flow diagram illustrating processing that may beperformed by a personalization identifier component, according toembodiments of the present disclosure.

FIG. 5 is a process flow diagram illustrating processing that may beperformed by the personalization identifier component, according toembodiments of the present disclosure.

FIG. 6 is a conceptual diagram of components of a device, according toembodiments of the present disclosure.

FIG. 7 is a block diagram conceptually illustrating example componentsof a device, according to embodiments of the present disclosure.

FIG. 8 is a block diagram conceptually illustrating example componentsof a system, according to embodiments of the present disclosure.

FIG. 9 illustrates an example of a computer network for use with theoverall system, according to embodiments of the present disclosure.

DETAILED DESCRIPTION

Automatic speech recognition (ASR) is a field of computer science,artificial intelligence, and linguistics concerned with transformingaudio data associated with speech into a token or other textualrepresentation of that speech. Similarly, natural language understanding(NLU) is a field of computer science, artificial intelligence, andlinguistics concerned with enabling computers to derive meaning fromnatural language inputs (such as spoken inputs). ASR and NLU are oftenused together as part of a language processing component of a system.Text-to-speech (TTS) is a field of computer science concerningtransforming textual and/or other data into audio data that issynthesized to resemble human speech.

A system may receive a user input (e.g., a spoken input), process theuser input to determine an intent representing the user input, andinvoke a skill to perform an action responsive to the user input basedon the determined intent. As used herein, a “skill” may refer tosoftware, that may be placed on a machine or a virtual machine (e.g.,software that may be launched in a virtual instance when called),configured to process at least a NLU hypothesis (including an intent andoptionally one or more entities), and perform one or more actions inresponse thereto. For example, in response to the user input “play musicby [artist],” a music skill may output music sung by the indicatedartist; in response to the user input “turn on the lights,” a smart homeskill may turn on “smart” lights associated with a user or groupprofile; in response to the user input “what is the weather,” a weatherskill may output weather information for a geographic locationcorresponding to the device that captured the user input; etc. In theforegoing examples, actions correspond to the outputting of music, theturning on of “smart” lights, and the outputting of weather information.A skill may operate in conjunction between various components of asystem, such as user devices, restaurant electronic ordering systems,taxi electronic booking systems, etc. in order to complete certainfunctions. What is referred to herein as a skill may sometimes bereferred to as an application, bot, or the like.

Sometimes, in addition to performing an action responsive to the userinput, the skill may output supplemental content (e.g., an advertisementnot being directly responsive to the user input). For example, when thesystem receives the user input “play music by [artist name], the skillmay, prior to outputting music sung by the indicated artist, outputsupplemental content informing the user of pay-for functionalityprovided by the skill. For further example, the skill may outputsupplemental content indicating functionality provided by another skill,content indicating a product that may be purchased, content detailing anupcoming sporting event, etc.

The skill may store a history of supplemental content the skill hasalready output to a user, and the user's response to the outputsupplemental content, and use such history to influence the futureoutput of supplemental content. For example, the skill may determine theskill has already output first supplemental content to the user and, inconjunction with performing an action responsive to a current userinput, determine to output second supplemental content to the user.

The present disclosure provides, among other things, techniques for asystem to allow a skill to customize output of supplemental content to auser, without the skill knowing the identity of the user. The system ofthe present disclosure is configured to generate a personalizationidentifier specific to a user and skill. For example, in at least someembodiments, the system may generate a personalization identifier as aone-way hash of the user's profile identifier, the skill's identifier,and a timestamp (e.g., representing a time when the personalizationidentifier is generated). The system may generate a differentpersonalization identifier for each skill enabled by the user (i.e.,indicated by the user as being authorized to process with respect touser inputs of the user). For example, the system may generate a firstpersonalization identifier using the user's profile identifier, a firstskill's identifier, and a first timestamp; a second personalizationidentifier using the user's profile identifier, a second skill'sidentifier, and a second timestamp; etc.

According to the present disclosure, when the system invokes a skillwith respect to a user input, the system identifies (or generates if onehas not been generated) a personalization identifier generated using atleast the user profile identifier of the current user, the skill'sidentifier and a timestamp. The system sends the personalizationidentifier to the skill. The skill may store a history of supplementalcontent the skill output using the personalization identifier, and theuser's response to the output supplemental content, and use such historyto customize the future output of supplemental content using thepersonalization identifier. However, based on the personalizationidentifier being generated using data not shared with the skill (i.e.,the user's profile identifier and the timestamp), the skill is unable toknow the exact identity of the user.

The teachings of the present disclosure provide an improved userexperience by enabling a skill to customize output of supplementalcontent to a user. The teachings of the present disclosure also increaseuser privacy by configuring such customized output using a uniquepersonalization identifier for a user but which cannot be used by theskill to identify the specific user.

In at least some embodiments, a user may provide a user input requestingone, some, or all of the user's personalization identifiers be reset. Inresponse to such a user input, the system may generate one or more newpersonalization identifiers, where each new personalization identifieris generated using the user's profile identifier, a skill identifier,and a timestamp corresponding to receipt of the user input requestingthe reset. Thereafter, the system may send the new personalizationidentifier to a skill, and not the previously used personalizationidentifier.

The system may be configured to not indicate, to a skill, when a newpersonalization identifier is generated. From a skill's perspective, allthe skill knows is that the skill has received a personalizationidentifier the skill has not received before. Thus, although originaland new personalization identifiers may be generated using a same userprofile identifier and a same skill identifier, it is difficult, if notimpossible, for the skill (corresponding to the skill identifier used togenerate both the original and new personalization identifiers) toassociate the original and new personalization identifiers.

The foregoing provides an improved user experience that results inincreased user privacy as personalizing recommendations based on theuser's behavior may not be tied to permanent user-identifying data(e.g., a user profile identifier, user demographic and other userprofile data, or device identifier, such as a device serial number).

At least some embodiments of the present disclosure relate to processingtechniques that allow the system to generate a personalizationidentifier while introducing minimal user-perceived latency. Forexample, when the system receives a user input, the system may generatethe personalization identifier at least partially in parallel toperforming ASR processing and/or NLU processing of the user input. Byparallelizing such processing, the system is able to provide thepersonalization identifier when the skill requests it, rather thanstarting generation of the personalization identifier when the skillrequests the same.

A system according to the present disclosure may be configured toincorporate user permissions and may only perform activities disclosedherein if approved by a user. As such, the systems, devices, components,and techniques described herein would be typically configured torestrict processing where appropriate and only process user data in amanner that ensures compliance with all appropriate laws, regulations,standards, and the like. The system and techniques can be implemented ona geographic basis to ensure compliance with laws in variousjurisdictions and entities in which the components of the system and/oruser are located.

FIG. 1 shows a system 100 configured to generate and use personalizationidentifiers for customizing output of supplemental content. Although thefigures and discussion of the present disclosure illustrate certainsteps in a particular order, the steps described may be performed in adifferent order (as well as certain steps removed or added) withoutdeparting from the present disclosure.

As shown in FIG. 1 , the system 100 may include a device 110 (local to auser 105) in communication with a system 120 across a network(s) 199.The network 199 may include the Internet and/or any other wide- orlocal-area network, and may include wired, wireless, and/or cellularnetwork hardware.

In some examples, the user 105 may speak an input, and the device 110may capture audio 107 representing the spoken input. In other examples,the user 105 may provide another type of input (e.g., typed naturallanguage input, selection of a button, selection of one or moredisplayed graphical interface elements, performance of a gesture, etc.).The device 110 may send (step 1) audio data (or other type of inputdata, such as, image data, text data, etc.) corresponding to the userinput to the system 120 for processing. An orchestrator component 130,of the system 120, may receive the input data from the device 110. Theorchestrator component 130 may be configured to coordinate datatransmissions between components of the system 120.

Upon receiving the input data, the orchestrator component 130 may call(step 2) a user recognition component 195, of the system 120, todetermine an identity of the user 105. The user recognition component195 may recognize the user 105 using various data and one or more userrecognition techniques. The user recognition component 195 may take asinput audio data representing the user input, when the user input is aspoken user input. The user recognition component 195 may perform userrecognition by comparing speech characteristics, in the audio data, tostored speech characteristics of users associated with the device 110(e.g., users having user profiles associated with and/or indicating thedevice 110). The user recognition component 195 may additionally oralternatively perform user recognition by comparing biometric data(e.g., fingerprint data, iris data, retina data, etc.), received by thesystem 120 in correlation with the user input, to stored biometric dataof users associated with the device 110. The user recognition component195 may additionally or alternatively perform user recognition bycomparing image data (e.g., including a representation of at least afeature of the user 105), received by the system 120 in correlation withthe user input, with stored image data including representations offeatures of different users associated with the device 110. The userrecognition component 195 may perform other or additional userrecognition processes, including those known in the art.

The user recognition component 195 determines whether the user inputoriginated from a particular user. For example, the user recognitioncomponent 195 may determine a first value representing a likelihood thatthe user input originated from a first user associated with the device110, a second value representing a likelihood that the user inputoriginated from a second user associated with the device 110, etc. Theuser recognition component 195 may also determine an overall confidenceregarding the accuracy of user recognition processing. The userrecognition component 195 may send (step 3), to the orchestratorcomponent 130, a single user profile identifier corresponding to themost likely user that originated the user input, or multiple userprofile identifiers (e.g., in the form of an N-best list) withrespective values representing likelihoods of respective usersoriginating the user input.

In some embodiments, the device 110 may be configured with a userrecognition component 695 (illustrated in and described with respect toFIG. 6 ). In such embodiments, the user recognition component 695 mayprocess in a similar manner to that described above with respect to theuser recognition component 195. In such instances, the device 110 maysend the user profile identifier(s), output by the user recognitioncomponent 695, to the orchestrator component 130 in conjunction withsending the input data at the step 1 (corresponding to the user input)to the orchestrator component 130.

The orchestrator component 130 calls (step 4 a) a context aggregationcomponent 135, of the system 120, to identify and/or generate one ormore personalization identifiers with respect to the user identifier (ortop-scoring user identifier) determined by the user recognitioncomponent 195/695. When calling the context aggregation component 135,the orchestrator component 130 may send, to the context aggregationcomponent 135, the user identifier (or top-scoring user identifier)determined by the user recognition component 195/695. Additionally oralternatively (e.g., in the situation where the user recognitioncomponent 195/695 is not able to determine a user profile identifierwith at least a threshold confidence), the orchestrator component 130may send, to the context aggregation component 135, a device identifiercorresponding to the device 110. By configuring the orchestratorcomponent 130 to call (step 4 a) the context aggregation component 135upon receiving the input data (at step 1) and/or the user identifier(s)from the user recognition component 195, a personalization identifierfor a skill 125, that will process with respect to the user input, maybe generated prior to the skill 125 needing the personalizationidentifier, thereby reducing user-perceived latency due to processing togenerate the personalization identifier.

In response to receiving the call from the orchestrator component 130 atstep 4 a, the context aggregation component 135 may determine varioususer-specific data from one or more storages. In at least someembodiments, the context aggregation component 135 may query (step 5) aprofile storage 170 for the various user-specific data.

The profile storage 170 may include a variety of data related toindividual users, groups of users, devices, etc. that interact with thesystem 120. As used herein, a “profile” refers to a set of dataassociated with a user, group of users, device, etc. The data of aprofile may include preferences specific to the user, group of users,device, etc.; user demographic information; input and outputcapabilities of one or more devices; internet connectivity data;subscription data; skill enablement data; and/or other data.

The profile storage 170 may include one or more user profiles. Each userprofile may be associated with a different user profile identifier. Eachuser profile may include various user identifying data (e.g., name,gender, address, language(s), etc.). Each user profile may also includepreferences of the user. Each user profile may include one or moredevice identifiers (e.g., device serial numbers), each representing arespective device registered to the user. Each user profile may includeskill identifiers of skills that the user has enabled. When a userenables a skill, the user is providing the system 120 with permission toallow the skill to execute with respect to the user's user inputs. Eachuser profile may include resources (e.g., contacts of a contact list,names of song playlists, etc.) of the user corresponding to the userprofile.

The profile storage 170 may include one or more group profiles. Eachgroup profile may be associated with a different group profileidentifier. A group profile may be specific to a group of users. Thatis, a group profile may be associated with two or more individual userprofiles. For example, a group profile may be a household profile thatis associated with user profiles associated with multiple users of asingle household. In another example, a group profile may be a vehicleprofile associated with multiple users of the vehicle. Various othertypes of group profiles are within the scope of the present disclosure.The present disclosure envisions any type of group profile correspondingto an environment and two or more users. A group profile may beassociated with (or include) one or more device profiles correspondingto one or more devices associated with the group profile.

The profile storage 170 may include one or more device profiles. Eachdevice profile may be associated with a different device identifier(e.g., device serial number). A device profile may include variousdevice identifying data, input/output characteristics, networkingcharacteristics, etc. A device profile may also include one or more userprofile identifiers, corresponding to one or more user profilesassociated with the device profile. For example, a household device'sprofile may include the user profile identifiers of users of thehousehold.

The context aggregation component 135 may query (step 5) the profilestorage 170 for user-specific data that may be used to generate one ormore personalization identifiers. The context aggregation component 135may query the profile storage 170 for skill identifiers corresponding toskills indicated as being enabled in a user profile associated with theuser profile identifier received from the orchestrator component 130 (atstep 4 a). The context aggregation component 135 may additionally oralternatively query the profile storage 170 for skill identifierscorresponding to skills indicated as being enabled in user profilesassociated with a device profile corresponding to the device identifier(of the device 110) received from the orchestrator component 130 (atstep 4 a).

The context aggregation component 135 may additionally or alternativelyquery the profile storage 170 for one or more user preferencesassociated with the user profile identifier and/or device identifier,where such a user preference indicates some parameter for controllingoutput of supplemental content to the user 5. For example, a userpreference may indicate when (e.g., time(s) of day, day(s) of week,etc.) supplemental content may or should not be output to the user 5and/or using the device 110. For further example, a user preference mayindicate how (e.g., as synthesized speech, displayed text, etc.)supplemental content may or should not be output to the user 5 and/orusing the device 110. In another example, a user preference may indicatea particular skill is not to output supplemental content to the user 5and/or using the device 110. For further example, a user preference mayindicate that a particular skill is permitted to output supplementalcontent to the user 5 and/or using the device 110, but that apersonalization identifier is not to be sent to the skill (therebypreventing the skill from tracking behavior of the user 105 with respectto output supplemental content). Other user preferences for controllingoutput of supplemental content exist, and are within the scope of thepresent disclosure.

The context aggregation component 135 may additionally or alternativelyquery the profile storage 170 for a timestamp representing when the user105 most recently requested a reset of the user's personalizationidentifier(s).

The context aggregation component 135 may additionally or alternativelyquery the profile storage 170 for location data (e.g., country, state,city, etc.) associated with the user profile identifier and/or deviceidentifier received from the orchestrator component 130 (at step 4 a).Such may be used for determining whether a personalization identifier isto be generated, as a location (e.g., country, state, city, etc.) mayhave a law or other rule that prohibits tracking of user behavior withrespect to output supplemental content. When the location data (receivedfrom the profile storage 170) corresponds to a location having such alaw or other rule, the system 120 may be configured to not generatepersonalization identifiers, as described further below.

While FIG. 1 shows the context aggregation component 135 querying theprofile storage 170 for various user-specific data, the system 120 mayalternatively be configured where the context aggregation component 135queries different storages for different user-specific data (e.g., afirst storage for enabled skill identifiers, a second storage for userpreferences, a third storage for a timestamp representing when the user105 most recently requested a reset of the user's personalizationidentifier(s), a fourth storage for location data, etc.).

The context aggregation component 135 may send (step 6), to apersonalization identifier component 140, the user profile identifier(the context aggregation component 135 received from the orchestratorcomponent 130 at step 4 a), the device identifier (the contextaggregation component 135 received from the orchestrator component 130at step 4 a), the enabled skill identifiers, the user preferences, thetimestamp representing when the user 105 most recently requested a resetof the user's personalization identifier(s), the location data, and anyother user-specific data determined by the context aggregationcomponent.

In response to receiving the user profile identifier, the deviceidentifier, and the user-specific data determined by the contextaggregation component 135, the personalization identifier component 140may identify and/or generate one or more personalization identifiers.The personalization identifier component 140 may communicate with one ormore storages (e.g., one or more lookup tables) in order to determineadditional data (for generating personalization identifiers) based onthe data received from the context aggregation component 135.

As described in detail herein with respect to FIG. 5 , the system 120may receive a user input to generate new personalization identifiers forthe user 105. As a result of the user input, the system 120 may ceasesending, to skills, personalization identifiers that were generated forthe user previous to receipt of the user input. The personalizationidentifier component 140 may communicate with a personalizationidentifier cache 145 that stores the presently usable personalizationidentifiers for various users of the system 120.

As illustrated in FIG. 2 , a personalization identifier may beassociated with various data it is generated using. For example, apersonalization identifier (represented as [Personalization identifier1] in FIG. 2 ) may be generated using (and thus associated with in thepersonalization identifier cache 145) a user profile identifier(represented as [User profile identifier 1] in FIG. 2 ), a skillidentifier (represented as [Skill identifier 1] in FIG. 2 ), and atimestamp (represented as [Timestamp 1] in FIG. 2 ). In situations wherethe user recognition component 195/695 is unable to determine anidentity of the user 105, for example, a personalization identifier(represented as [Personalization identifier 2] in FIG. 2 ) may begenerated using (and thus associated with in the personalizationidentifier cache 145) a device identifier (represented as [Deviceidentifier 2] in FIG. 2 ), a skill identifier (represented as [Skillidentifier 2] in FIG. 2 ), and a timestamp (represented as [Timestamp 2]in FIG. 2 ). As yet a further example, a personalization identifier(represented as [Personalization identifier n] in FIG. 2 ) may begenerated using (and thus associated with in the personalizationidentifier cache 145) a user profile identifier (represented as [Userprofile identifier n] in FIG. 2 ), a device identifier (represented as[Device identifier n] in FIG. 2 ), a skill identifier (represented as[Skill identifier n] in FIG. 2 ), and a timestamp (represented as[Timestamp n] in FIG. 2 ).

In some embodiments, a personalization identifier may be generated usinguser profile data in addition to or instead of using a user profileidentifier. Examples of such user profile data that may be used include,but are not limited to, user age, user gender, user geographic location,and other user demographic information. It will thus be appreciated thatanytime the present disclosure mentions a personalization identifierbeing generated using a user profile identifier, that thepersonalization identifier may be generated using various user profiledata in addition to or instead of the user profile identifier.

In some embodiments, a personalization identifier may be generated usingdevice profile data in addition to or instead of using a deviceidentifier. Examples of such device profile data that may be usedinclude, but are not limited to, device name, device geographiclocation, device input capabilities, device output capabilities, andother device information. It will thus be appreciated that anytime thepresent disclosure mentions a personalization identifier being generatedusing a device identifier, that the personalization identifier may begenerated using various device profile data in addition to or instead ofthe device identifier.

In view of the foregoing, it will be appreciated that a personalizationidentifier may be generated using user-specific data (e.g., user profileidentifier and/or user profile data), device-specific data (e.g., deviceidentifier and/or device profile data), a time of day (e.g., atimestamp), and a skill identifier.

The personalization identifier component 140 may also communicate with askill configuration storage 155 that stores various skill-based data. Asillustrated in FIG. 3 , a skill developer identifier may be associatedwith one or more skill identifiers, where each skill identifierrepresents a skill configurable by a skill developer corresponding tothe skill developer identifier. Each skill developer identifier, and byextension each skill identifier associated with the skill developeridentifier, may be associated with data representing whether a skill ofthe skill developer are permitted to receive a skill selectionidentifier. In addition, each skill developer identifier, and byextension each skill identifier associated with the skill developeridentifier, may be associated with data representing whether a skill ofthe skill developer is permitted to share a received skill selectionidentifier (e.g., with another skill corresponding to the same skilldeveloper, with one or more supplemental content sources, etc.).

In some embodiments, an entity (e.g., a business entity) may havevarious skill developers that create various skills for the entity. Theskill configuration storage 155 may store data such that an identifierof the entity may be associated with each skill generated by a skilldeveloper of the entity. In some embodiments, when generating apersonalization identifier for a skill of the entity, thepersonalization identifier may be generated using the entity identifierinstead of the skill identifier. As such, a user may have a singlepersonalization identifier for all skills of an entity.

The personalization identifier component 140 may additionallycommunicate with a personalization identifier storage 165. Thepersonalization identifier storage 165 may store similar data to thepersonalization identifier cache 145. However, whereas thepersonalization identifier cache 145 may store personalizationidentifiers generated since receipt of a most recent user inputrequesting new personalization identifiers be generated, thepersonalization identifier storage 165 may store personalizationidentifiers generated prior to and after receipt of such a user input.Whereas the personalization identifier cache 145 may be configured tostore data for rapid recall (as detailed herein below), thepersonalization identifier storage 165 may store data that may be usedto determine what supplemental content was sent to what users (andoptionally using what devices). Such may enable the system 120 toevaluate whether a skill is spamming a user of the system 120.

The personalization identifier component 140 may perform one or morechecks prior to generating a new personalization identifier. Referringto FIG. 4 , the personalization identifier component 140 receives (step402, part of step 6 in FIG. 1 ) a skill identifier from the contextaggregation component 135. The personalization identifier component 140also receives (step 404, part of step 6 in FIG. 1 ) a user profileidentifier and/or a device identifier from the context aggregationcomponent 135. The personalization identifier component 140 maydetermine (step 406, step 7 in FIG. 1 ) whether the personalizationidentifier cache 145 already includes a personalization identifierassociated with (i.e., generated using) the skill identifier and theuser profile identifier and/or device identifier. If the personalizationidentifier component 140 determines the personalization identifier cache145 includes a personalization identifier associated with the skillidentifier and the user profile identifier and/or the device identifier,the personalization identifier component 140 may cease (step 408)processing with respect to the skill identifier, as there is already anapplicable skill selection identifier generated and ready to be sent tothe skill corresponding to the skill identifier.

Conversely, if the personalization identifier component 140 determinesthe personalization identifier cache 145 does not include apersonalization identifier associated with (i.e., generated using) theskill identifier and the user profile identifier and/or the deviceidentifier, the personalization identifier component 140 may query (step410, step 8 in FIG. 1 ) the skill configuration storage 155 forpermission data associated with the skill identifier. For example, thepermission data may indicate whether the skill is permitted to receive apersonalization identifier, and/or whether the skill is permitted toshare a received personalization identifier (e.g., with another skillcorresponding to the same skill developer, with one or more supplementalcontent sources, etc.).

The personalization identifier component 140 determines (step 412)whether the permission data indicates a personalization identifiershould not be generated for the skill. If the personalization identifiercomponent 140 determines the permission data indicates a personalizationidentifier should not be generated for the skill, the personalizationidentifier component 140 may cease (step 414) processing with respect tothe skill identifier.

Conversely, if the personalization identifier component 140 determinesthe permission data indicates a personalization identifier may begenerated for the skill, the personalization identifier component 140may generate (step 416) a personalization identifier using the skillidentifier and the user profile identifier and/or the device profileidentifier. In situations where the user recognition component 195/695is unable to determine an identity of the user 105, the personalizationidentifier component 140 may receive the device identifier, but not auser profile identifier, from the context aggregation component 135. Insuch situations, the personalization identifier component 140 maygenerate the personalization identifier using the skill identifier, thedevice identifier, and a timestamp corresponding to generation of thepersonalization identifier. In instances where the personalizationidentifier component 140 receives the user profile identifier and thedevice identifier from the context aggregation component 135, thepersonalization identifier component 140 may generate thepersonalization identifier using: the skill identifier, the user profileidentifier, and the timestamp; or the skill identifier, the user profileidentifier, the device identifier, and the timestamp. It will beappreciated that a personalization identifier, generated using a deviceidentifier but not a user profile identifier, may be used by a skill tocustomize the output of supplemental content for supplemental contentoutput using the device 110 (regardless of the user that provided thecorresponding user input). It will further be appreciated that apersonalization identifier, generated using a user profile identifierbut not a device identifier, may be used by a skill to customize theoutput or supplemental content for the user, regardless of which thedevice the user is interacting with. It will also be appreciated that apersonalization identifier, generated using a user profile identifierand a device identifier, may be used by a skill to customize the outputof supplemental content for the user based on the device the user isinteracting with. Whether the personalization identifier component 140is configured to generate the personalization identifier using a userprofile identifier, a device identifier, or both a user profileidentifier and a device identifier may be configurable and may be basedon the data input to the personalization identifier component 140 in anygiven instance.

The personalization identifier component 140 may use various techniquesto generate the personalization identifier. In some embodiments, thepersonalization identifier component 140 may input the skill identifier,the user profile identifier and/or the device identifier, and atimestamp into a one-way hash function to generate a personalizationidentifier having, for example, a 8-4-4-4-12 format (i.e., aXXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX format, where each “X” represents anumber or letter generated using the one-way hash function).

After generating the personalization identifier, the personalizationidentifier component 140 may store (418, and steps 7 and 9 in FIG. 1 )the personalization identifier in the personalization identifier cache145 and the personalization identifier storage 165.

The personalization identifier component 140 may perform the processingof FIG. 4 with respect to each skill identifier received from thecontext aggregation component 135 at step 6 in FIG. 1 .

In some embodiments, the system 120 may communicate with one or moreskills having a rating satisfying (e.g., meeting or exceeding) acondition (e.g., a threshold rating). The rating of a skill may begenerated based on implicit and/or explicit user feedback provided byvarious users of the system 120 with respect to the skill. With userpermission, the personalization identifier 140 may perform theprocessing of FIG. 4 with respect to the one or more highly-rated skillseven though one or more of the highly-rated skills may not be indicatedas enabled in the user profile of the user 105 and/or the device profileif the device 110.

Referring again to FIG. 1 , in addition to calling the contextaggregation component 135 at step 4 a, the orchestrator component 130may send input audio data (received at step 1) to an ASR component 150.The ASR component 150 transcribes the input audio data into one or moreASR hypotheses. An ASR hypothesis may be configured as a textualinterpretation of the speech of the spoken input, or may be configuredin another manner, such as one or more tokens. Each ASR hypothesis mayrepresent a different likely interpretation of the speech in the inputaudio data.

The ASR component 150 may interpret speech in the input audio data basedon the similarity between the input audio data and language models. Forexample, the ASR component 150 may compare the input audio data withmodels for sounds (e.g., subword units or phonemes) and sequences ofsounds to identify words that match the sequence of sounds in the spokeninput. Alternatively, the ASR component 150 may use a finite statetransducer (FST) to implement the language model functions.

When the ASR component 150 generates more than one ASR hypothesis for asingle spoken input, each ASR hypothesis may be assigned a score (e.g.,probability score, confidence score, etc.) representing a likelihoodthat the corresponding ASR hypothesis matches the spoken input (e.g.,representing a likelihood that a particular set of words matches thosein the spoken input). The score may be based on a number of factorsincluding, for example, the similarity of the sound in the spoken inputto models for language sounds (e.g., an acoustic model), and thelikelihood that a particular word, which matches the sounds, would beincluded in the sentence at the specific location (e.g., using alanguage model). Based on the considered factors and the assignedconfidence score, the ASR component 150 may output an ASR hypothesisthat most likely matches the spoken input, or may output multiple ASRhypotheses in the form of a lattice or an N-best list, with each ASRhypothesis corresponding to a respective score.

The ASR component 150 may send (step 10) the one or more ASR hypothesesto the orchestrator component 130, which may send (step 11) the one ormore ASR hypotheses to a NLU component 160.

The NLU component 1 processes the one or more ASR hypotheses todetermine one or more NLU hypotheses. The NLU component 160 may performintent classification (IC) processing on an ASR hypothesis to determinean intent of the user input. An intent corresponds to an action to beperformed that is responsive to the user input. To perform ICprocessing, the NLU component 160 may communicate with a database ofwords linked to intents. For example, a music intent database may linkwords and phrases such as “quiet,” “volume off,” and “mute” to a <Mute>intent. The NLU component 160 identifies intents by comparing words andphrases in the ASR hypothesis to the words and phrases in an intentsdatabase. In some embodiments, the NLU component 160 may communicatewith multiple intents databases, where each intents database correspondsto one or more intents associated with a particular domain or skill.

For example, IC processing of the user input “play my workout playlist”may determine an intent of <PlayMusic>. For further example, ICprocessing of the user input “call mom” may determine an intent of<Call>. In another example, IC processing of the user input “call momusing video” may determine an intent of <VideoCall>. In yet anotherexample, IC processing of the user input “what is today's weather” maydetermine an intent of <OutputWeather>.

As used herein, a “domain” refers to a collection of relatedfunctionality. A non-limiting list of domains includes a smart homedomain (corresponding to smart home functionality), a music domain(corresponding to music functionality), a video domain (corresponding tovideo functionality), a weather domain (corresponding to weatherfunctionality), a communications domain (corresponding to one- ortwo-way communications functionality), and a shopping domain(corresponding to shopping functionality).

A group of skills, configured to provide related functionality, may beassociated with a domain. For example, one or more music skills may beassociated with a music domain.

The NLU component 160 may also perform named entity recognition (NER)processing on the ASR hypothesis to determine one or more entity namesthat are mentioned in the user input and that may be needed for post-NLUprocessing. For example, NER processing of the user input “play [songname]” may determine an entity type of “SongName” and an entity name of“[song name].” For further example, NER processing of the user input“call mom” may determine an entity type of “Recipient” and an entityname of “mom.” In another example, NER processing of the user input“what is today's weather” may determine an entity type of “Date” and anentity name of “today.”

In at least some embodiments, the intents identifiable by the NLUcomponent 160 may be linked to one or more grammar FSTs with entitytypes to be populated with entity names. Each entity type of a FST maycorrespond to a portion of an ASR hypothesis that the NLU component 160believes corresponds to an entity name. For example, a FST correspondingto a <PlayMusic> intent may correspond to sentence structures such as“Play {Artist Name},” “Play {Album Name},” “Play {Song name},” “Play{Song name} by {Artist Name},” etc.

For example, the NLU component 160 may perform NER processing toidentify words in an ASR hypothesis as subject, object, verb,preposition, etc. based on grammar rules and/or models. Then, the NLUcomponent 160 may perform IC processing using the identified verb toidentify an intent. Thereafter, the NLU component 160 may again performNER processing to determine a FST associated with the identified intent.For example, a FST for a <PlayMusic> intent may specify a list of entitytypes applicable to play the identified “object” and any object modifier(e.g., a prepositional phrase), such as {Artist Name}, {Album Name},{Song name}, etc.

The NLU component 160 may generate one or more NLU hypotheses, whereeach NLU hypothesis includes an intent and optionally one or more entitytypes and corresponding entity names. In some embodiments, a NLUhypothesis may be associated with a score representing a confidence ofNLU processing performed to determine the NLU hypothesis with which thescore is associated.

As described above, the system 120 may perform speech processing usingtwo different components (e.g., the ASR component 150 and the NLUcomponent 160). In at least some embodiments, the system 120 mayimplement a spoken language understanding (SLU) component configured toprocess input audio data to determine one or more NLU hypotheses.

The SLU component may be functionally equivalent to a combination of theASR component 150 and the NLU component 160. Yet, the SLU component mayprocess input audio data and directly determine the one or more NLUhypotheses, without an intermediate step of generating one or more ASRhypotheses. As such, the SLU component may take input audio datarepresenting speech and attempt to make a semantic interpretation of thespeech. That is, the SLU component may determine a meaning associatedwith the speech and then implement that meaning. For example, the SLUcomponent may interpret input audio data representing speech from theuser 105 in order to derive a desired action. The SLU component mayoutput a most likely NLU hypothesis, or multiple NLU hypothesesassociated with respective confidence or other scores (such asprobability scores, etc.).

The NLU component 160 (or SLU component) may send (step 12) the one ormore NLU hypotheses to the orchestrator component 130, which may sendthe one or more NLU hypotheses to a domain selection component 172. Thedomain selection component 172 is configured to determine which domainthe current user input most-likely corresponds to. In a simple example,the domain selection component 172 may determine the domain of thecurrent user input based on the intent in the received (or top-scoringreceived) NLU hypothesis. In some examples, the domain selectioncomponent 172 may determine the domain of the current user input basedon the intent and an entity type(s) represented in the received (ortop-scoring received) NLU hypothesis. For example, if the domainselection component 172 determines the received (or top-scoringreceived) NLU hypothesis includes a <TurnOnLight> intent, the domainselection component 172 may determine the user input corresponds to asmart home domain. For further example, if the domain selectioncomponent 172 determines the received (or top-scoring received) NLUhypothesis includes a <Play> intent and a “song” entity type, the domainselection component 172 may determine the user input corresponds to amusic domain. In another example, if the domain selection component 172determines the received (or top-scoring received) NLU hypothesisincludes a <Play> intent and a “video” entity type, the domain selectioncomponent 172 may determine the user input corresponds to a videodomain. For further example, if the domain selection component 172determines the received (or top-scoring received) NLU hypothesisincludes an <OutputWeather> intent, the domain selection component 172may determine the user input corresponds to a weather domain. In anotherexample, if the domain selection component 172 determines the received(or top-scoring received) NLU hypothesis includes a <Call> intent, thedomain selection component 172 may determine the user input correspondsto a communications domain. For further example, if the domain selectioncomponent 172 determines the received (or top-scoring received) NLUhypothesis includes a <Purchase> intent, the domain selection component172 may determine the user input corresponds to a shopping domain.

The domain selection component 172 may send (step 14), to theorchestrator component 130, an indication of the domain determined bythe domain selection component 172. The system 120 may include adifferent domain component for each domain (e.g., a single musiccomponent for the music domain, a single weather component for theweather domain, etc.). In response to receiving the indication of thedomain from the domain selection component 172, the orchestratorcomponent 130 may determine a domain component 175 corresponding to theindicated domain, and send (step 15) the one or more NLU hypotheses(output by the NLU component 160 and input to the domain selectioncomponent 172) to the domain component 175.

The domain component 175 may be associated with a group of skillsconfigured to perform functionality corresponding to the domain. Forexample, a music domain component may be associated with a plurality ofmusic skills configured to perform music-related processing, a smarthome domain component may be associated with a plurality of smart homedomains configured to perform smart-home related processing, etc.

The domain component 175 may be configured to select, from among theskills associated therewith, which skill is to process with respect tothe current user input. For example, if a music domain receives a NLUhypothesis (or top-scoring NLU hypothesis) including a <Play> intent, anentity type of “song,” and an entity name of “[song name],” the musicdomain may determine which, of the skills associated therewith, is ableto play the indicated song.

In some embodiments, the domain component 175 may select a skill basedon one or more skill preferences of the user 105, as indicated in theuser's profile in the profile storage 170. For example, a smart homedomain may select a first smart home skill (to process with respect tothe current user input) based on the user profile, of the user 105,indicating the user 105 prefers the first smart home skill.

In some embodiments, the domain component 175 may select a skill basedon a subscription of the user 105, as indicated in the user's profile inthe profile storage 170. For example, a podcast domain may select afirst podcast skill (to process with respect to the current user input)based on the user profile, of the user 105, indicating the user 105 haspurchased a subscription to the first podcast skill. Conversely, thepodcast domain may not select a second podcast skill based on the userprofile not including subscription information for the second podcastskill.

In some embodiments, the domain component 175 may select a skill basedon skill rating. The rating of a skill may be generated based onimplicit and/or explicit user feedback provided by various users of thesystem 120 with respect to the skill. In some embodiments, the domaincomponent 175 may not select a skill having a skill rating failing tosatisfy (e.g., failing to meet or exceed) a condition (e.g., a thresholdskill rating).

Once the domain component 175 selects a skill 125 to process withrespect to the current user input, the domain component 175 may send(step 16), to the orchestrator component 130, a request for thepersonalization identifier associated with the skill identifier, of theselected skill, and the user profile identifier (of the user 105) and/orthe device identifier of the device 110. The request may include theskill identifier and the user profile identifier and/or the deviceidentifier. The orchestrator component 130 may send (step 17) therequest to the context aggregation component 135, which may send (step18) the request to the personalization identifier component 140.

In response to receiving the request, the personalization identifiercomponent 140 may query the personalization identifier cache 145, ratherthan the personalization identifier storage 165, for a personalizationidentifier associated with the skill identifier and the user profileidentifier and/or the device identifier. The personalization identifiercomponent 140 may query the content selectin identifier cache 145because the cache is configured to store personalization identifiersthat may presently be sent to skills, whereas the personalizationidentifier storage 165 is additionally configured to storepersonalization identifiers that are no longer permitted to be sent toskills (e.g., due to a user requesting they personalization identifiersbe rest). The personalization identifier component 140 may send (step19) the personalization identifier to the context aggregation component135. The context aggregation component 135 may send (step 20) thepersonalization identifier to the orchestrator component 130, which maysend (step 21) the personalization identifier to the domain component175.

In some embodiments, the domain component 175 may request thepersonalization identifier directly from the personalization identifiercomponent 140, without going through the orchestrator component 130 andthe context aggregation component 135.

In some embodiments, rather than sending the request (at step 16) afterthe domain component 175 has selected the skill 125, the domaincomponent 175 may, prior to selecting the skill 125, send (at step 16) arequest for a respective content identifier associated with each skillidentifier, associated with the domain component 175, and the userprofile identifier and/or the device identifier. For example, if thedomain component 175 is associated with a first skill (corresponding toa first skill identifier) and a second skill (corresponding to a secondskill identifier), the domain component 175 may send (at step 16) arequest for (1) a first personalization identifier associated with thefirst skill identifier and the user profile identifier and/or the deviceidentifier, and (2) a second personalization identifier associated withthe second skill identifier and the user profile identifier and/or thedevice identifier. In response, the domain component 175 may receive,from the personalization identifier component 140, a list of two or morepersonalization identifiers, where each personalization identifier inthe list is associated with a respective skill identifier.

By calling the context aggregation component 135 (at step 4 a) andsending the input audio data to the ASR component 150 (at step 4 b)simultaneously or nearly simultaneously (as shown in FIG. 1 ), thepersonalization identifier component 140 may be able to generate thepersonalization identifier prior to the domain component 175 outputtingthe request at step 16. Such parallelized processing of the contextaggregation component 135 and the personalization identifier component140 with the ASR component 150, the NLU component 160, the domainselection component 172, and the domain component 175, may result inreduced user-perceived latency.

After selecting the skill 125, the domain component 175 may send (step22), to a skill invocation component 180, the skill identifier of theskill 125, the NLU hypothesis (including an intent and optionally one ormore entity types with corresponding entity names) to be processed bythe skill 125, and the personalization identifier associated with theskill identifier and received from the personalization identifier 140.The skill invocation component 180 may be a component of the system 120that acts as an interface between the system 120 and various skills.Upon receiving the data from the domain component 175 at step 22, theskill invocation component 180 may send (step 23) at least a portion ofthe data (e.g., the NLU hypothesis and the personalization identifier)to the skill 125 corresponding to the skill identifier received from thedomain component 175.

The skill 125 may process the NLU hypothesis to determine first outputdata responsive to the user input (e.g., based on the intent and entitytype(s) and entity name(s) represented in the NLU hypothesis). Forexample, if the NLU hypothesis includes a <Play> intent, an entity typeof “song,” and an entity name of “[song name],” the skill 125 maydetermine the first output data to include a file identifiercorresponding to audio data of the song. For further example, if the NLUhypothesis includes an <OutputWeather> intent, the skill 125 maydetermine the first output data to include natural language text or someother representation of weather information for a geographic location ofthe user 105/device 110. Other examples are possible, and within theknowledge of one skilled in the art.

Sending of the personalization identifier, to the skill 125, may notindicate, to the skill 125, that the skill 125 must output supplementalcontent. Rather, sending of the personalization identifier, to the skill125 may simply indicate that the skill 125 may output supplementalcontent based on the personalization identifier if the skill 125 sochooses.

If the skill 125 determines supplemental content should be output, theskill 125 may determine the supplemental content using various sources.The skill 125 may store a repository of supplemental content. The skill125 may select a supplemental content from the repository to outputbased on the received personalization identifier, one or more entitytypes and names in the received NLU hypothesis, and previously-selectedand output supplemental content associated with the personalizationidentifier (if any). While the skill 125 may select the supplementalcontent based on the personalization identifier, and previously-selectedand output supplemental content associated with the personalizationidentifier, how the personalization identifier is generated (as detailedherein above) may make it difficult, if not impossible, for the skill125 to determine the identity of the user 105.

The skill 125 may additionally or alternatively communicate with asupplemental content component 185 of the system 120. The supplementalcontent component 185 may store a repository of supplemental content.The skill 125 may send a request for supplemental content to thesupplemental content component 185, where the request includes thepersonalization identifier and one or more entity types and namesrepresented in the NLU hypothesis received by the skill 125. Thesupplemental content component 185 may select a supplemental contentfrom the repository based on the personalization identifier, the one ormore entity types and names, and previously-selected and outputsupplemental content associated with the personalization identifier (ifany). The supplemental content component 185 may send the selectedsupplemental content to the skill 125.

The skill 125 may additionally or alternatively communicate with asupplemental content system 190 implemented separate from the system120. The supplemental content system 190 may store a repository ofsupplemental content. The skill 125 may send a request for supplementalcontent to the supplemental content system 190, where the requestincludes the personalization identifier and one or more entity types andnames represented in the NLU hypothesis received by the skill 125. Thesupplemental content system 190 may select a supplemental content fromthe repository based on the personalization identifier, the one or moreentity types and names, and previously-selected and output proactivecontent associated with the personalization identifier (if any). Theproactive content system 190 may send the selected proactive content tothe skill 125. While the proactive content system 190 may select theproactive content based on the personalization identifier, andpreviously-selected and output proactive content associated with thepersonalization identifier, how the personalization identifier isgenerated (as detailed herein above) may make it difficult, if notimpossible, for the proactive content system 190 to determine theidentity of the user 105.

The skill 125 may send (step 24), to the skill invocation component 180,first output data corresponding to a response to the user input, andsecond output data corresponding to the selected supplemental content.In some instances, the skill 125 may determine supplemental contentshould not be output, and may simply send the first output data to theskill invocation component 180. The skill invocation component 180 maysend (step 25) the first output data and the second output data (if suchexists) to the orchestrator component 130. The orchestrator component130 may then send (step 26) the first output data and the second outputdata (if such exists) to the device 110 for output (e.g., as audioand/or displayed text). In some embodiments, the orchestrator component130 may invoke a TTS component of the system 120 to generate outputaudio data including synthesized speech corresponding to the firstoutput data and/or the second output data.

In one method of synthesis called unit selection, the TTS componentmatches input data against a database of recorded speech. The TTScomponent selects matching units of recorded speech and concatenates theunits together to form audio data. In another method of synthesis calledparametric synthesis, the TTS component varies parameters such asfrequency, volume, and noise to determine audio data including anartificial speech waveform. Parametric synthesis uses a computerizedvoice generator, sometimes called a vocoder.

As illustrated in FIG. 1 and described above, the domain component 175may request and receive the personalization identifier from thepersonalization identifier component 140 (directly or indirectly via thecontext aggregation component 135). Alternatively, in some embodimentsthe skill invocation component 180 may request and receive thepersonalization identifier from the personalization identifier component140 (directly or indirectly via the context aggregation component 135).As another alternative, in some embodiments the skill 125 may requestand receive the personalization identifier from the personalizationidentifier component 140 (directly or indirectly via the contextaggregation component 135). Regardless of the foregoing that isimplemented, user-perceived latency may be reduced by causing thepersonalization identifier component 140 to process at least partiallyin parallel to the ASR component 150, the NLU component 160, and/or thedomain selection component 172.

Different components of the system 120 (other than the orchestratorcomponent 130 as illustrated in FIG. 1 ) may be configured to send thecall (of step 4 a) to the context aggregation component 135. In someembodiments, the domain component 175 may call the context aggregationcomponent 135 to generate (or determine already generated)personalization identifiers using the skill identifiers of the skillsassociated with the domain component 175, and the user profileidentifier and/or the device identifier. In other embodiments, thedomain component 175, the skill invocation component 180, or the skill125 may call the context aggregation component 135 to generate (ordetermine an already generate) personalization identifier using theskill identifier of the skill 125 (selected by the domain component175), and the user profile identifier and/or the device identifier.

In some embodiments, the personalization identifier component 140 maydetermine whether a new personalization identifier is to be generatedbased on the occurrence of one or more events. The personalizationidentifier component 140 may subscribe to an event bus of the system 120(not illustrated) to receive events therefrom. For example, thepersonalization identifier component 140 may subscribe to receive skillenablement events, device registration events, and events requestingreset of a user's personalization identifiers.

As illustrated in FIG. 5 , the personalization identifier component 140may receive (step 502), from the event bus, event data corresponding toan event. In response to receiving the event data, the personalizationidentifier component 140 may determine (step 504) whether a newpersonalization identifier(s) is to be generated.

For example, if the event data indicates a user has enabled a skill, thepersonalization identifier component 140 may determine whether thepersonalization identifier cache 145 is presently storing apersonalization identifier generated using (i.e., associated with) theuser profile identifier of the user (as represented in the event data)and the skill identifier of the newly enabled skill (as represented inthe event data). If the personalization identifier component 140determines the personalization identifier cache 145 is presently storingthe personalization identifier, the personalization identifier component140 may cease (step 506) processing with respect to the event data.Conversely, if the personalization identifier component 140 determinesthe personalization identifier cache 145 is not presently storing thepersonalization identifier, the personalization identifier component 140may generate (step 508) a new personalization identifier using the userprofile identifier, the skill identifier, and a timestamp correspondingto a present time. The personalization identifier component 140 may thenstore (step 510) the personalization identifier in the personalizationidentifier cache 145 and the personalization identifier storage 165.

For further example, if the event data indicates a user has registered adevice 110 (e.g., causing the device id of the device 110 to becomeassociated with the user's profile identifier), the personalizationidentifier component 140 may determine whether the personalizationidentifier cache 145 is presently storing personalization identifier(s)generated using (i.e., associated with) the device identifier (asrepresented in the event data), and optionally the user profileidentifier of the user. If the personalization identifier component 140determines the personalization identifier cache 145 is presently storingthe personalization identifier, the personalization identifier component140 may cease (step 506) processing with respect to the event data.Conversely, if the personalization identifier component 140 determinesthe personalization identifier cache 145 is not presently storing apersonalization identifier generated using the device identifier andoptionally the user profile identifiers, the personalization identifiercomponent 140 may generate (step 508) a new personalizationidentifier(s) for each enabled skill associated with the user profileidentifier, where each new personalization identifier is generated usingan enabled skill identifier, the device identifier (and optionally theuser profile identifier), and a timestamp corresponding to a presenttime. The personalization identifier component 140 may then store (step510) the personalization identifier(s) in the personalization identifiercache 145 and the personalization identifier storage 165.

In another example, if the event data indicates a user has provided auser input (e.g., a spoken input, selection of a virtual buttondisplayed on a touchscreen of a device, etc.) requesting the user'spersonalization identifiers be reset, the personalization identifiercomponent 140 may automatically determine, based on the event datacorresponding to a “personalization identifier reset” event type, thatthe personalization identifier(s) of the user is to be newly generated.As a result, the personalization identifier component 140 may cause thepersonalization identifier cache 145 to delete therefrom presentlystored personalization identifiers associated with (i.e., generatedusing) the user's profile identifier. The personalization identifiercomponent 140 may not, however, cause the personalization identifiers(associated with the user's profile identifier) to be deleted from thepersonalization identifier storage 165. The personalization identifiercomponent 140 may also generate (step 508) a new personalizationidentifier(s) for each enabled skill associated with the user's profileidentifier, where each new personalization identifier is generated usingan enabled skill identifier, the user's profile identifier (andoptionally a device identifier associated with the user's profileidentifier), and a timestamp corresponding to a present time. Thepersonalization identifier component 140 may then store (step 510) thenewly generated personalization identifier(s) in the personalizationidentifier cache 145 and the personalization identifier storage 165.

The foregoing describes illustrative components and processing of thesystem 120. The following describes illustrative components andprocessing of the device 110. As illustrated in FIG. 6 , in at leastsome embodiments the system 120 may receive audio data 611 from thedevice 110, to recognize speech in the received audio data 611, and toperform functions in response to the recognized speech. In at least someembodiments, these functions involve sending directives (e.g.,commands), from the system 120 to the device 110 to cause the device 110to perform an action, such as output synthesized speech (responsive tothe user input) via a loudspeaker(s), and/or control one or moresecondary devices by sending control commands to the one or moresecondary devices.

Thus, when the device 110 is able to communicate with the system 120over the network(s) 199, some or all of the functions capable of beingperformed by the system 120 may be performed by sending one or moredirectives over the network(s) 199 to the device 110, which, in turn,may process the directive(s) and perform one or more correspondingactions. For example, the system 120, using a remote directive that isincluded in response data (e.g., a remote response), may instruct thedevice 110 to output synthesized speech via a loudspeaker(s) of (orotherwise associated with) the device 110, output content (e.g., music)via the loudspeaker(s) of (or otherwise associated with) the device 110,display content on a display of (or otherwise associated with) thedevice 110, and/or send a directive to a secondary device (e.g., adirective to turn on a smart light in communication with the device110). It will be appreciated that the system 120 may be configured toprovide other functions in addition to those discussed herein, such as,without limitation, providing step-by-step directions for navigatingfrom an origin location to a destination location, conducting anelectronic commerce transaction on behalf of the user 105 as part of ashopping function, establishing a communication session (e.g., an audioor video call) between the user 105 and another user, and so on.

A microphone or array of microphones (of or otherwise associated with adevice 110) may capture audio 107. The device 110 processes the audiodata 611, representing the audio 107, to determine whether speech isdetected. The device 110 may use various techniques to determine whetherthe audio data 611 includes speech. In some examples, the device 110 mayapply voice activity detection (VAD) techniques. Such techniques maydetermine whether speech is present in the audio data 611 based onvarious quantitative aspects of the audio data 611, such as the spectralslope between one or more frames of the audio data 611, the energylevels of the audio data 611 in one or more spectral bands, thesignal-to-noise ratios of the audio data 611 in one or more spectralbands, or other quantitative aspects. In other examples, the device 110may implement a classifier configured to distinguish speech frombackground noise. The classifier may be implemented by techniques suchas linear classifiers, support vector machines, and decision trees. Instill other examples, the device 110 may apply Hidden Markov Model (HMM)or Gaussian Mixture Model (GMM) techniques to compare the audio data 611to one or more acoustic models in storage, whether the acoustic modelsmay include models corresponding to speech, noise (e.g., environmentalnoise or background noise), or silence. Still other techniques may beused to determine whether speech is present in the audio data 611.

Once speech is detected in the audio data 611 representing the audio107, the device 110 may determine if the speech is directed at thedevice 110/system 120. In at least some embodiments, such determinationmay be made using a wakeword detection component 620. The wakeworddetection component 620 may be configured to detect various wakewords.In at least some examples, each wakeword may correspond to a name of adifferent digital assistant. An example wakeword/digital assistant nameis “Alexa.”

Wakeword detection is typically performed without performing linguisticanalysis, textual analysis, or semantic analysis. Instead, the audiodata 611, representing the audio 107, is analyzed to determine ifspecific characteristics of the audio data 611 match preconfiguredacoustic waveforms, audio signatures, or other data corresponding to awakeword.

Thus, the wakeword detection component 620 may compare the audio data611 to stored data to detect a wakeword. One approach for wakeworddetection applies general large vocabulary continuous speech recognition(LVCSR) systems to decode audio signals, with wakeword searching beingconducted in the resulting lattices or confusion networks. Anotherapproach for wakeword detection builds HMIs for each wakeword andnon-wakeword speech signals, respectively. The non-wakeword speechincludes other spoken words, background noise, etc. There can be one ormore HMMs built to model the non-wakeword speech characteristics, whichare named filler models. Viterbi decoding is used to search the bestpath in the decoding graph, and the decoding output is further processedto make the decision on wakeword presence. This approach can be extendedto include discriminative information by incorporating a hybrid DNN-HMMdecoding framework. In another example, the wakeword detection component620 may be built on deep neural network (DNN)/recursive neural network(RNN) structures directly, without HMM being involved. Such anarchitecture may estimate the posteriors of wakewords with context data,either by stacking frames within a context window for DNN, or using RNN.Follow-on posterior threshold tuning or smoothing is applied fordecision making. Other techniques for wakeword detection, such as thoseknown in the art, may also be used.

Once the wakeword detection component 620 detects a wakeword, the device110 may “wake” and the wakeword detection component 620 may send anindication of such detection to a hybrid selector 624. In response toreceiving the indication, the hybrid selector 624 may send the audiodata 611 to the system 120 and/or an ASR component 650 implemented bythe device 110. The wakeword detection component 620 may also send anindication, to the hybrid selector 624, representing a wakeword was notdetected. In response to receiving such an indication, the hybridselector 624 may refrain from sending the audio data 611 to the system120, and may prevent the ASR component 650 from processing the audiodata 611. In this situation, the audio data 611 can be discarded.

The device 110 may conduct its own speech processing using on-devicelanguage processing components (such as an on-device SLU component, theASR component 650, and/or a NLU component 660) similar to the mannerdiscussed above with respect to the system-implemented SLU component,ASR component 150, and NLU component 160. The device 110 may alsointernally include, or otherwise have access to, other components suchas one or more skills (including the skill 125), a user recognitioncomponent 695 (configured to process in a similar manner to the userrecognition component 195 implemented by the system 120), a profilestorage 610 (configured to store similar profile data to the profilestorage 170 implemented by the system 120), a TTS component 680(configured to process in a similar manner to the TTS componentimplemented by the system 120), a domain selection component 672(configured to process in a similar manner to the domain selectioncomponent 172 implemented by the system 120), a domain component 675(configured to process in a similar manner to the domain component 175implemented by the system 120), a skill invocation component 680(configured to process in a similar manner to the skill invocationcomponent 180 implemented by the system 120), a context aggregationcomponent 635 (configured to process in a similar manner to the contextaggregation component 135 implemented by the system 120), apersonalization identifier component 640 (configured to process in asimilar manner to the personalization identifier component 140implemented by the system 120), a personalization identifier cache 645(configured to store similar data to the personalization identifiercache 145 implemented by the system 120), a skill configuration storage655 (configured to store similar data to the skill configuration storage155 implemented by the system 120), a personalization identifier storage665 (configured to store similar data to the personalization identifierstorage 165 implemented by the system 120), a supplemental contentcomponent 685 (configured to processing in a similar manner to thesupplemental content component 185 implemented by the system 120),and/or other components. In at least some embodiments, the storages ofthe device 110 may only store data for users specifically associatedwith the device 110.

In at least some embodiments, the on-device language processingcomponents may not have the same capabilities as the language processingcomponents implemented by the system 120. For example, the on-devicelanguage processing components may be configured to handle only a subsetof the user inputs that may be handled by the system-implementedlanguage processing components. For example, such subset of user inputsmay correspond to local-type user inputs, such as those controllingdevices or components associated with a user's home. In suchcircumstances the on-device language processing components may be ableto more quickly interpret and respond to a local-type user input, forexample, than processing that involves the system 120. If the device 110attempts to process a user input for which the on-device languageprocessing components are not necessarily best suited, the one or moreNLU hypotheses, determined by the on-device components, may have a lowconfidence or other metric indicating that the processing by theon-device language processing components may not be as accurate as theprocessing done by the system 120.

The hybrid selector 624, of the device 110, may include a hybrid proxy(HP) 626 configured to proxy traffic to/from the system 120. Forexample, the HP 626 may be configured to send messages to/from a hybridexecution controller (HEC) 627 of the hybrid selector 624. For example,command/directive data received from the system 120 can be sent to theHEC 627 using the HP 626. The HP 626 may also be configured to allow theaudio data 611 to pass to the system 120 while also receiving (e.g.,intercepting) this audio data 611 and sending the audio data 611 to theHEC 627.

In at least some embodiments, the hybrid selector 624 may furtherinclude a local request orchestrator (LRO) 628 configured to notify theASR component 650 about the availability of the audio data 611, and tootherwise initiate the operations of on-device language processing whenthe audio data 611 becomes available. In general, the hybrid selector624 may control execution of on-device language processing, such as bysending “execute” and “terminate” events/instructions. An “execute”event may instruct a component to continue any suspended execution(e.g., by instructing the component to execute on apreviously-determined intent in order to determine a directive).Meanwhile, a “terminate” event may instruct a component to terminatefurther execution, such as when the device 110 receives directive datafrom the system 120 and chooses to use that remotely-determineddirective data.

Thus, when the audio data 611 is received, the HP 626 may allow theaudio data 611 to pass through to the system 120 and the HP 626 may alsoinput the audio data 611 to the ASR component 650 by routing the audiodata 611 through the HEC 627 of the hybrid selector 624, whereby the LRO628 notifies the ASR component 650 of the audio data 611. At this point,the hybrid selector 624 may wait for response data from either or boththe system 120 and/or the on-device language processing components.However, the disclosure is not limited thereto, and in some examples thehybrid selector 624 may send the audio data 611 only to the ASRcomponent 650 without departing from the disclosure. For example, thedevice 110 may process the audio data 611 on-device without sending theaudio data 611 to the system 120.

The ASR component 650 is configured to receive the audio data 611 fromthe hybrid selector 624, and to recognize speech in the audio data 611,and the NLU component 660 is configured to determine an intent from therecognized speech (an optionally one or more named entities), and todetermine how to act on the intent by generating directive data (e.g.,instructing a component to perform an action). In some cases, adirective may include a description of the intent (e.g., an intent toturn off {device A}). In some cases, a directive may include (e.g.,encode) an identifier of a second device(s), such as kitchen lights, andan operation to be performed at the second device(s). Directive data maybe formatted using Java, such as JavaScript syntax, or JavaScript-basedsyntax. This may include formatting the directive using JSON. In atleast some embodiments, a device-determined directive may be serialized,much like how remotely-determined directives may be serialized fortransmission in data packets over the network(s) 199. In at least someembodiments, a device-determined directive may be formatted as aprogrammatic application programming interface (API) call with a samelogical operation as a remotely-determined directive. In other words, adevice-determined directive may mimic a remotely-determined directive byusing a same, or a similar, format as the remotely-determined directive.

A NLU hypothesis (output by the NLU component 660) may be selected asusable to respond to a user input, and local response data may be sentto the hybrid selector 624, such as a “ReadyToExecute” response. Thehybrid selector 624 may then determine whether to use directive datafrom the on-device components to respond to the user input, to usedirective data received from the system 120, assuming a remote responseis even received (e.g., when the device 110 is able to access the system120 over the network(s) 199), or to determine output data requestingadditional information from the user 105.

The device 110 and/or the system 120 may associate a unique identifierwith each user input. The device 110 may include the unique identifierwhen sending the audio data 611 to the system 120, and the response datafrom the system 120 may include the unique identifier to identify towhich user input the response data corresponds.

In at least some embodiments, the device 110 may include one or moreskills. The skill(s) installed on (or in communication with) the device110 may include, without limitation, one or more smart home skillsand/or a device control skill configured to control a second device(s),a music skill configured to output music, a navigation skill configuredto output directions, a shopping skill configured to conduct anelectronic purchase, and/or the like.

FIG. 7 is a block diagram conceptually illustrating a device 110 thatmay be used with the system 120. FIG. 8 is a block diagram conceptuallyillustrating example components of a remote device, such as the system120, which may assist with ASR processing, NLU processing, etc.; and askill 125. A system (120/125) may include one or more servers. A“server” as used herein may refer to a traditional server as understoodin a server/client computing structure but may also refer to a number ofdifferent computing components that may assist with the operationsdiscussed herein. For example, a server may include one or more physicalcomputing components (such as a rack server) that are connected to otherdevices/components either physically and/or over a network and iscapable of performing computing operations. A server may also includeone or more virtual machines that emulates a computer system and is runon one or across multiple devices. A server may also include othercombinations of hardware, software, firmware, or the like to performoperations discussed herein. The system 120 may be configured to operateusing one or more of a client-server model, a computer bureau model,grid computing techniques, fog computing techniques, mainframetechniques, utility computing techniques, a peer-to-peer model, sandboxtechniques, or other computing techniques.

Multiple systems (120/125) may be included in the system 100 of thepresent disclosure, such as one or more systems 120 for performing ASRprocessing, one or more systems 120 for performing NLU processing, andone or more skills 125, etc. In operation, each of these systems mayinclude computer-readable and computer-executable instructions thatreside on the respective device (120/125), as will be discussed furtherbelow.

Each of these devices (110/120/125) may include one or morecontrollers/processors (704/804), which may each include a centralprocessing unit (CPU) for processing data and computer-readableinstructions, and a memory (706/806) for storing data and instructionsof the respective device. The memories (706/806) may individuallyinclude volatile random access memory (RAM), non-volatile read onlymemory (ROM), non-volatile magnetoresistive memory (MRAM), and/or othertypes of memory. Each device (110/120/125) may also include a datastorage component (708/808) for storing data andcontroller/processor-executable instructions. Each data storagecomponent (708/808) may individually include one or more non-volatilestorage types such as magnetic storage, optical storage, solid-statestorage, etc. Each device (110/120/125) may also be connected toremovable or external non-volatile memory and/or storage (such as aremovable memory card, memory key drive, networked storage, etc.)through respective input/output device interfaces (702/802).

Computer instructions for operating each device (110/120/125) and itsvarious components may be executed by the respective device'scontroller(s)/processor(s) (704/804), using the memory (706/806) astemporary “working” storage at runtime. A device's computer instructionsmay be stored in a non-transitory manner in non-volatile memory(706/806), storage (708/808), or an external device(s). Alternatively,some or all of the executable instructions may be embedded in hardwareor firmware on the respective device in addition to or instead ofsoftware.

Each device (110/120/125) includes input/output device interfaces(702/802). A variety of components may be connected through theinput/output device interfaces (702/802), as will be discussed furtherbelow. Additionally, each device (110/120/125) may include anaddress/data bus (724/824) for conveying data among components of therespective device. Each component within a device (110/120/125) may alsobe directly connected to other components in addition to (or instead of)being connected to other components across the bus (724/824).

Referring to FIG. 7 , the device 110 may include input/output deviceinterfaces 702 that connect to a variety of components such as an audiooutput component such as a speaker 712, a wired headset or a wirelessheadset (not illustrated), or other component capable of outputtingaudio. The device 110 may also include an audio capture component. Theaudio capture component may be, for example, a microphone 720 or arrayof microphones, a wired headset or a wireless headset (not illustrated),etc. If an array of microphones is included, approximate distance to asound's point of origin may be determined by acoustic localization basedon time and amplitude differences between sounds captured by differentmicrophones of the array. The device 110 may additionally include adisplay 716 for displaying content. The device 110 may further include acamera 718.

Via antenna(s) 714, the input/output device interfaces 702 may connectto one or more networks 199 via a wireless local area network (WLAN)(such as WiFi) radio, Bluetooth, and/or wireless network radio, such asa radio capable of communication with a wireless communication networksuch as a Long Term Evolution (LTE) network, WiMAX network, 3G network,4G network, 5G network, etc. A wired connection such as Ethernet mayalso be supported. Through the network(s) 199, the system may bedistributed across a networked environment. The I/O device interface(702/802) may also include communication components that allow data tobe exchanged between devices such as different physical servers in acollection of servers or other components.

The components of the device 110, the system 120, and/or a skill 125 mayinclude their own dedicated processors, memory, and/or storage.Alternatively, one or more of the components of the device 110, thesystem 120, and/or a skill 125 may utilize the I/O interfaces (702/802),processor(s) (704/804), memory (706/806), and/or storage (708/808) ofthe device(s) 110, system 120, or the skill 125, respectively. Thus, theASR component XXA50 may have its own I/O interface(s), processor(s),memory, and/or storage; the NLU component XXA60 may have its own I/Ointerface(s), processor(s), memory, and/or storage; and so forth for thevarious components discussed herein.

As noted above, multiple devices may be employed in a single system. Insuch a multi-device system, each of the devices may include differentcomponents for performing different aspects of the system's processing.The multiple devices may include overlapping components. The componentsof the device 110, the system 120, and a skill 125, as described herein,are illustrative, and may be located as a stand-alone device or may beincluded, in whole or in part, as a component of a larger device orsystem.

As illustrated in FIG. 9 , multiple devices (110 a-110 j, 120, 125) maycontain components of the system and the devices may be connected over anetwork(s) 199. The network(s) 199 may include a local or privatenetwork or may include a wide network such as the Internet. Devices maybe connected to the network(s) 199 through either wired or wirelessconnections. For example, a speech-detection device 110 a, a smart phone110 b, a smart watch 110 c, a tablet computer 110 d, a vehicle 110 e, adisplay device 110 f, a smart television 110 g, a washer/dryer 110 h, arefrigerator 110 i, and/or a microwave 110 j may be connected to thenetwork(s) 199 through a wireless service provider, over a WiFi orcellular network connection, or the like. Other devices are included asnetwork-connected support devices, such as the system 120, the skill(s)125, and/or others. The support devices may connect to the network(s)199 through a wired connection or wireless connection. Networked devicesmay capture audio using one-or-more built-in or connected microphones orother audio capture devices, with processing performed by ASRcomponents, NLU components, or other components of the same device oranother device connected via the network(s) 199, such as the ASRcomponent XXA50, the NLU component XXA60, etc. of the system 120.

The concepts disclosed herein may be applied within a number ofdifferent devices and computer systems, including, for example,general-purpose computing systems, speech processing systems, anddistributed computing environments.

The above aspects of the present disclosure are meant to beillustrative. They were chosen to explain the principles and applicationof the disclosure and are not intended to be exhaustive or to limit thedisclosure. Many modifications and variations of the disclosed aspectsmay be apparent to those of skill in the art. Persons having ordinaryskill in the field of computers and speech processing should recognizethat components and process steps described herein may beinterchangeable with other components or steps, or combinations ofcomponents or steps, and still achieve the benefits and advantages ofthe present disclosure. Moreover, it should be apparent to one skilledin the art, that the disclosure may be practiced without some or all ofthe specific details and steps disclosed herein.

Aspects of the disclosed system may be implemented as a computer methodor as an article of manufacture such as a memory device ornon-transitory computer readable storage medium. The computer readablestorage medium may be readable by a computer and may compriseinstructions for causing a computer or other device to perform processesdescribed in the present disclosure. The computer readable storagemedium may be implemented by a volatile computer memory, non-volatilecomputer memory, hard drive, solid-state memory, flash drive, removabledisk, and/or other media. In addition, components of system may beimplemented as in firmware or hardware, such as an acoustic front end(AFE), which comprises, among other things, analog and/or digitalfilters (e.g., filters configured as firmware to a digital signalprocessor (DSP)).

Conditional language used herein, such as, among others, “can,” “could,”“might,” “may,” “e.g.,” and the like, unless specifically statedotherwise, or otherwise understood within the context as used, isgenerally intended to convey that certain embodiments include, whileother embodiments do not include, certain features, elements and/orsteps. Thus, such conditional language is not generally intended toimply that features, elements, and/or steps are in any way required forone or more embodiments or that one or more embodiments necessarilyinclude logic for deciding, with or without other input or prompting,whether these features, elements, and/or steps are included or are to beperformed in any particular embodiment. The terms “comprising,”“including,” “having,” and the like are synonymous and are usedinclusively, in an open-ended fashion, and do not exclude additionalelements, features, acts, operations, and so forth. Also, the term “or”is used in its inclusive sense (and not in its exclusive sense) so thatwhen used, for example, to connect a list of elements, the term “or”means one, some, or all of the elements in the list.

Disjunctive language such as the phrase “at least one of X, Y, Z,”unless specifically stated otherwise, is understood with the context asused in general to present that an item, term, etc., may be either X, Y,or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, suchdisjunctive language is not generally intended to, and should not, implythat certain embodiments require at least one of X, at least one of Y,or at least one of Z to each be present.

As used in this disclosure, the term “a” or “one” may include one ormore items unless specifically stated otherwise. Further, the phrase“based on” is intended to mean “based at least in part on” unlessspecifically stated otherwise.

What is claimed is:
 1. A computer-implemented method comprising:receiving, from a device, first input audio data corresponding to afirst spoken input; performing automatic speech recognition (ASR)processing using the first input audio data to determine an ASRhypothesis corresponding to the first spoken input; performing naturallanguage understanding (NLU) processing using the ASR hypothesis todetermine a NLU hypothesis including an intent corresponding to thefirst spoken input; determining a user profile identifier correspondingto the first input audio data; determining a skill identifiercorresponding to a skill represented as enabled in a user profileassociated with the user profile identifier, the skill being configuredto process the NLU hypothesis and respond to the first spoken input;generating a first personalization identifier to be used by the skill toselect supplemental content for output with respect to the NLUhypothesis corresponding to the user profile identifier, the firstpersonalization identifier generated by applying a hash function to theuser profile identifier, the skill identifier, and a timestampcorresponding to a present time; sending, to the skill, the NLUhypothesis and the first personalization identifier; receiving, from theskill and after sending the NLU hypothesis and the first personalizationidentifier: first output data generated using the NLU hypothesis, thefirst output data being responsive to the first spoken input, and secondoutput data determined using the first personalization identifier, thesecond output data corresponding to supplemental content to be outputwithout receiving a user input requesting output of the second outputdata; and sending the first output data and the second output data tothe device for output.
 2. The computer-implemented method of claim 1,further comprising: sending, in response to receiving the first inputaudio data, the user profile identifier to a context aggregationcomponent; determining, by the context aggregation component and basedon receiving the user profile identifier, that the skill is representedas enabled in the user profile; and sending, by the context aggregationcomponent and based on determining the skill is represented as enabled,the skill identifier to a first component configured to generate thefirst personalization identifier, wherein determining that the skill isrepresented as enabled, sending the skill identifier to the firstcomponent, and determining the first personalization identifier areperformed at least partially in parallel to performing the ASRprocessing.
 3. The computer-implemented method of claim 1, furthercomprising: sending the NLU hypothesis to a domain component associatedwith the skill; sending, by the domain component and to a contextaggregation component, a request for a personalization identifierassociated with the skill; sending, by the context aggregationcomponent, the request to a first component that generated the firstpersonalization identifier; identifying, by the first component, thefirst personalization identifier in response to receiving the request;sending, by the first component, the personalization identifier to thecontext aggregation component; sending, by the context aggregationcomponent, the first personalization identifier to the domain component;and sending, by the domain component, the NLU hypothesis and the firstpersonalization identifier to the skill.
 4. The computer-implementedmethod of claim 1, further comprising: receiving second input audio datacorresponding to a second spoken input; determining the second spokeninput requests generation of new personalization identifiers associatedwith the user profile identifier; generating, based on determining thesecond spoken input requests generation of new personalizationidentifiers, a second personalization identifier using the user profileidentifier, the skill identifier, and a second timestamp correspondingto receipt of the second input audio data; and ceasing, based ondetermining the second spoken input requests generation of newpersonalization identifiers, use of the first personalization identifierwith respect to user inputs associated with the user profile identifier.5. A computer-implemented method comprising: receiving first input datacorresponding to a first user input; determining an intent correspondingto the first user input; determining a user profile identifierassociated with the first input data; determining a first skillidentifier corresponding to a first skill to execute in response to thefirst user input; generating a first personalization identifier to beused by the first skill to select supplemental content for output withrespect to user inputs corresponding to the user profile identifier, thefirst personalization identifier generated using the user profileidentifier, the first skill identifier, and a first timestamp; sendingthe intent to the first skill; sending the first personalizationidentifier to the first skill; receiving, from the first skill, firstoutput data generated using the intent, the first output data beingresponsive to the first user input; receiving, from the first skill,second output data generated using the first personalization identifier,the second output data corresponding to supplemental content to beoutput without receiving a user input requesting output of the secondoutput data; and causing the first output data and the second outputdata to be presented.
 6. The computer-implemented method of claim 5,further comprising: receiving a second skill identifier corresponding toa second skill configured to select supplemental content for output withrespect to user inputs associated with the user profile identifier; andgenerating a second personalization identifier by performing a hashfunction using the user profile identifier, the second skill identifier,and the first timestamp.
 7. The computer-implemented method of claim 5,further comprising: receiving the first input data from a device;determining a device identifier corresponding to the device; andgenerating the first personalization identifier further using the deviceidentifier.
 8. The computer-implemented method of claim 5, furthercomprising: sending, in response to receiving the first input data, theuser profile identifier to a context aggregation component; determining,by the context aggregation component, that the first skill isrepresented as enabled in a user profile corresponding to the userprofile identifier; and sending, by the context aggregation componentand based on determining the first skill is represented as enabled, thefirst skill identifier to a first component configured to generate thefirst personalization identifier, wherein determining that the firstskill is represented as enabled, sending the first skill identifier tothe first component, and determining the first personalizationidentifier are performed at least partially in parallel to determiningthe intent corresponding to the first user input.
 9. Thecomputer-implemented method of claim 8, further comprising: sending, bythe context aggregation component and to the first component, a secondtimestamp representing receipt of a second user input requestinggeneration of new personalization identifiers associated with the userprofile identifier.
 10. The computer-implemented method of claim 5,further comprising: sending the intent to the first skill; receiving,from the first skill and after sending the intent, a request for apersonalization identifier; and generating the first personalizationidentifier based on receiving the request.
 11. The computer-implementedmethod of claim 5, further comprising: sending, by a domain componentand to a context aggregation component, a request for a personalizationidentifier associated with the first skill; identifying, by the contextaggregation component, the first personalization identifier in responseto receiving the request; sending, by the context aggregation componentand after identifying the first personalization identifier, the firstpersonalization identifier to the domain component; and sending, by thedomain component, the intent and the first personalization identifier tothe first skill.
 12. The computer-implemented method of claim 5, furthercomprising: receiving second input data corresponding to a second userinput; determining the second user input requests generation of newpersonalization identifiers associated with the user profile identifier;generating, based on determining the second user input requestsgeneration of new personalization identifiers, a second personalizationidentifier using the user profile identifier, the first skillidentifier, and a second timestamp corresponding to receipt of thesecond input data; and ceasing, based on determining the second userinput requests generation of new personalization identifiers, use of thefirst personalization identifier with respect to user inputs associatedwith the user profile identifier.
 13. A computing system comprising: atleast one processor; and at least one memory comprising instructionsthat, when executed by the at least one processor, cause the computingsystem to: receive first input data corresponding to a first user input;determine an intent corresponding to the first user input; determine auser profile identifier associated with the first input data; determinea first skill identifier corresponding to a first skill to execute inresponse to the first user input; generate a first personalizationidentifier to be used by the first skill to select supplemental contentfor output with respect to user inputs corresponding to the user profileidentifier, the first personalization identifier generated using theuser profile identifier, the first skill identifier, and a firsttimestamp; send the intent to the first skill; send the firstpersonalization identifier to the first skill; receive, from the firstskill, first output data generated using the intent, the first outputdata being responsive to the first user input; receive, from the firstskill, second output data generated using the first personalizationidentifier, the second output data corresponding to supplemental contentto be output without receiving a user input requesting output of thesecond output data; and cause the first output data and the secondoutput data to be presented.
 14. The computing system of claim 13,wherein the at least one memory further comprises instructions that,when executed by the at least one processor, further cause the computingsystem to: receive a second skill identifier corresponding to a secondskill configured to select supplemental content for output with respectto user inputs associated with the user profile identifier; and generatea second personalization identifier by performing a hash function usingthe user profile identifier, the second skill identifier, and the firsttimestamp.
 15. The computing system of claim 13, wherein the at leastone memory further comprises instructions that, when executed by the atleast one processor, further cause the computing system to: receive thefirst input data from a device; determine a device identifiercorresponding to the device; and generate the first personalizationidentifier further using the device identifier.
 16. The computing systemof claim 13, wherein the at least one memory further comprisesinstructions that, when executed by the at least one processor, furthercause the computing system to: send, in response to receiving the firstinput data, the user profile identifier to a context aggregationcomponent; determine, by the context aggregation component, that thefirst skill is represented as enabled in a user profile corresponding tothe user profile identifier; and send, by the context aggregationcomponent and based on determining the first skill is represented asenabled, the first skill identifier to a first component configured togenerate the first personalization identifier, wherein determining thatthe first skill is represented as enabled, sending the first skillidentifier to the first component, and determining the firstpersonalization identifier are performed at least partially in parallelto determining the intent corresponding to the first user input.
 17. Thecomputing system of claim 16, wherein the at least one memory furthercomprises instructions that, when executed by the at least oneprocessor, further cause the computing system to: send, by the contextaggregation component and to the first component, a second timestamprepresenting receipt of a second user input requesting generation of newpersonalization identifiers associated with the user profile identifier.18. The computing system of claim 13, wherein the at least one memoryfurther comprises instructions that, when executed by the at least oneprocessor, further cause the computing system to: send the intent to thefirst skill; receive, from the first skill and after sending the intent,a request for a personalization identifier; and generate the firstpersonalization identifier based on receiving the request.
 19. Thecomputing system of claim 13, wherein the at least one memory furthercomprises instructions that, when executed by the at least oneprocessor, further cause the computing system to: send, by a domaincomponent and to a context aggregation component, a request for apersonalization identifier associated with the first skill; identify, bythe context aggregation component, the first personalization identifierin response to receiving the request; send, by the context aggregationcomponent and after identifying the first personalization identifier,the first personalization identifier to the domain component; and send,by the domain component, the intent and the first personalizationidentifier to the first skill.
 20. The computing system of claim 13,wherein the at least one memory further comprises instructions that,when executed by the at least one processor, further cause the computingsystem to: receive second input data corresponding to a second userinput; determine the second user input requests generation of newpersonalization identifiers associated with the user profile identifier;generate, based on determining the second user input requests generationof new personalization identifiers, a second personalization identifierusing the user profile identifier, the first skill identifier, and asecond timestamp corresponding to receipt of the second input data; andcease, based on determining the second user input requests generation ofnew personalization identifiers, use of the first personalizationidentifier with respect to user inputs associated with the user profileidentifier.